This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday October 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

# Import required R libraries
library(fpp3)
library(tidyverse)
library(readxl)
library(seasonal)
library(stringr)

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an Excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

# Read in data
atm_data_raw <- read_excel("data/ATM624Data.xlsx")

# Properly convert the DATE column to match true input
# https://stackoverflow.com/questions/43230470/how-to-convert-excel-date-format-to-proper-date-in-r
atm_data_raw <- atm_data_raw %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"))

# Initial output to see data
#head(atm_data_raw)

# Output summary for high level assessment
summary(atm_data_raw)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19
# Check dimensions to understand breadth of data
dim(atm_data_raw)
## [1] 1474    3
# 1474    3
#Cash in hundreds

Dates of dataset start at 2009-05-01 and end with 2010-05-14. This indicates 379 dates, which is 14 more than a single year of 365 days.

Dimensions of the initial dataset indicate 1474 observations, which again indicates more than a single year’s worth of data. The 3 columns are DATE, ATM, and Cash. The DATE indicates the specific data, the ATM indicates which of the 4 ATMs, and Cash represents the total amount withdrawn for the given date and ATM.

# Define as tsibble with DATE as index
atm_data_ts <- atm_data_raw %>%
  as_tsibble(index = DATE, key = ATM)

# Output tsibble to confirm proper format and definition
atm_data_ts %>%
  filter(DATE > "2010-04-30")
## # A tsibble: 14 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2010-05-01 <NA>     NA
##  2 2010-05-02 <NA>     NA
##  3 2010-05-03 <NA>     NA
##  4 2010-05-04 <NA>     NA
##  5 2010-05-05 <NA>     NA
##  6 2010-05-06 <NA>     NA
##  7 2010-05-07 <NA>     NA
##  8 2010-05-08 <NA>     NA
##  9 2010-05-09 <NA>     NA
## 10 2010-05-10 <NA>     NA
## 11 2010-05-11 <NA>     NA
## 12 2010-05-12 <NA>     NA
## 13 2010-05-13 <NA>     NA
## 14 2010-05-14 <NA>     NA

A closer look at the data shows 14 observations starting with date 2010-05-01 and ending on 2010-05-14 do not indicate an ATM nor a Cash amount, thus these 14 observations will be ignored in the upcoming evaluation. The removal of thse 14 observations also makes for a cleaner dataset, as now the total observations is 1460, which is exacting 365 for each of the 4 ATMs.

Plot all data

atm_data_ts %>%
  autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM Withdrawls")

An initial plot of the time series shows the ATM4 data falls in a range greater than the other three ATMs. With the extreme outlier in ATM4, the plot does not provide much value for comparison across the four ATMS.

ATM1

# Filter to ATM1 data
atm1_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM1')

# Output summary
summary(atm1_data_ts)
##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.89  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00  
##                                          NA's   :3

The summary of data for ATM1 confirms the 365 days (single year) of observations and also indicates 3 missing Cash values.

atm1_data_ts$DATE[is.na(atm1_data_ts$Cash)]
## [1] "2009-06-13" "2009-06-16" "2009-06-22"

The 3 dats with missing Cash values are displayed above. Not sure the significance of the missing data occurring in a small window of time, so will consider imputation.

atm1_data_ts %>%
  autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM1")

Initial plot of the ATM1 data appears to show a weekly seasonal component along with a dip between October 2009 and January 2010. There does not appear to be a clear increasing or decreasing trend.

Impute Missing Data

Given the low volume of missing values, three, I will simply impute the data with the median value of the ATM1 dataset.

# Calculate median value for ATM1
median <- median(atm1_data_ts$Cash, na.rm=TRUE)

# Set NAs to median
atm1_data_ts$Cash[is.na(atm1_data_ts$Cash)] <- median

summary(atm1_data_ts)
##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.95  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00

Summary output confirms no missing data for ATM1 Cash amounts.

Simply updating the ATM1 data tsibble is defined properly.

atm1_data_ts <- atm1_data_ts %>%
  mutate(DATE = as_date(DATE)) %>%
  update_tsibble(index = DATE)
#atm1_data_ts

Decomposition

To better understand the trend and seasonal nature of the ATM1 dataset, decomposition is performed.

# From: https://stats.stackexchange.com/questions/494013/control-the-period-for-daily-time-series-in-tsibbles
atm1_dcmp <- atm1_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm1_dcmp %>% components() %>% autoplot()

STL decomposition shows the remainder plot has the most impact on the data based on the gray bar on the left. The seasonal plot shows a clear seasonal pattern, but while the seasonal pattern remains the same, the remainder plot shows greater variance toward the end of plot, beginning in February 2010. The trend line confirms no real trend present in the dataset.

components(atm1_dcmp) %>%
  as_tsibble() %>%
  filter_index("2009-05-01" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

The above seasonally adjusted trendline confirms that despite a few outliers through December 2009, the seasonal component does well to address the weekly nature of the data, but the plot above indicates starting in February 2010, the seasonal component does not properly define the dataset. Based on the above plot, a new weekly pattern appears to emerge in February 2010.

components(atm1_dcmp) %>%
  as_tsibble() %>%
  filter_index("2010-02-16" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

Above plot shows that the seasonally adjusted data does not account for the weekly seasonal nature for the last three months. Clearly a shift in the weekly seasonal pattern has occurred.

atm1_data_ts %>% ACF() %>% autoplot()

The ACF of the ATM1 data shows a clear correlation at lags 7, 14, and 21, which is expected for seasonal data following a weekly pattern.

atm1_data_ts %>% PACF() %>% autoplot()

The PACF plot re-confirms the correlations at lag 7 and 14. Interesting, correlation also appears at lags 2 and 5.

atm1_data_ts %>% gg_season(Cash, period = "week") +
  labs(y="$USD (in hundreds)", title="ATM Withdrawals")

To better understand the weekly seasonal pattern, the gg_season plot above does show some sort of shift on Tuesday and Thursday.

atm1_data_ts %>%
  gg_subseries(period = "week") +
  labs(
    y = "$USD (in hundreds)",
    title = "ATM Withdrawals"
  )

The `gg_subseries plot confirms the shift in February occurs on Tuesday, Wednesday, and Thursday. The other days of the week indicate some variance and outliers, but nothing as dramatic as Tuesday through Thursday.

Train and Test Model

Before forecasting the data for May 2010, I want to train and test the different model functions for proper evaluation. The models are trained on 292 days, or approximately 80% of the year represented.

# Create training set (assume 1 year of data)
# 292 days for training, approx. 80% of data
train_atm1 <- atm1_data_ts %>% 
  filter_index("2009-05-01" ~ "2010-02-17")

#train_atm1

# Fit the models
fit_atm1 <- train_atm1 %>%
  model(
    Naive = NAIVE(Cash),
    `Seasonal naive` = SNAIVE(Cash),
    `Random walk` = RW(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

fit_atm1 %>% accuracy()
## # A tibble: 7 × 11
##   ATM   .model         .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>          <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM1  Naive          Training 0.284  49.9  37.4 -85.0 120.  2.07  1.77  -0.356  
## 2 ATM1  Seasonal naive Training 0.203  28.1  18.0 -83.3 103.  1     1      0.0957 
## 3 ATM1  Random walk    Training 0.284  49.9  37.4 -85.0 120.  2.07  1.77  -0.356  
## 4 ATM1  Arima          Training 0.179  21.9  14.3 -66.5  82.7 0.790 0.780  0.00341
## 5 ATM1  ETS_Add        Training 0.175  21.4  13.7 -73.3  89.7 0.761 0.763  0.0940 
## 6 ATM1  ETS_Mult       Training 1.98   21.6  13.7 -75.8  90.6 0.760 0.769  0.0956 
## 7 ATM1  ETS_Damp       Training 1.46   21.3  13.3 -76.4  90.4 0.737 0.759  0.0756

Using the different model functions from the Hyndman textbook, the best performing model based on RMSE is the ETS model with dampening. Overall, the three ETS models and the ARIMA model all perform well in comparison. The seasonal naive model also performe well with an RMSE slightly higher than the ARIMA model and the three ETS models.

# Generate forecasts for 72 days
fc_atm1 <- fit_atm1 %>% forecast(h = "72 days")

fc_atm1 %>% accuracy(atm1_data_ts)
## # A tibble: 7 × 11
##   .model         ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM1  Test   -2.20  50.8  37.8 -466.  502.  2.09  1.81 -0.0565 
## 2 ETS_Add        ATM1  Test   -3.25  52.4  38.0 -479.  514.  2.11  1.86  0.00348
## 3 ETS_Damp       ATM1  Test  -10.4   54.2  40.0 -517.  547.  2.22  1.93 -0.0246 
## 4 ETS_Mult       ATM1  Test   -7.03  52.9  39.2 -494.  527.  2.17  1.88 -0.00808
## 5 Naive          ATM1  Test  -98.2  104.   98.2 -893.  893.  5.44  3.71 -0.0585 
## 6 Random walk    ATM1  Test  -98.2  104.   98.2 -893.  893.  5.44  3.71 -0.0585 
## 7 Seasonal naive ATM1  Test   -5.21  64.1  53.7 -460.  509.  2.97  2.28  0.00878

The accuracy of the models on the test dataset shows ARIMA performing the best with an RMSE of 50.77. The three ETS models do also perform well.

# Plot forecasts against actual values
fc_atm1 %>%
  autoplot(filter_index(train_atm1, "2010-02-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(atm1_data_ts, "2010-02-10" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

So plotting the seven forecasts along with the original dataset shows the seasonal pattern of the forecasts does not match the seasonal pattern of the actual dataset. Based on the observations in the decomposition section, the weekly pattern has changed, and thus training the model on the first 80% of the data actually misses the new pattern found in the data. Further investigation is needed.

fit_atm1_arima <- train_atm1 %>%
  model(ARIMA(Cash))

# Check the residuals of Seasonal Naive
fit_atm1_arima %>% gg_tsresiduals()

To further confirm the assumption of a bad model, I’ve taken the ARIMA model, the best performing based on RMSE, and displayed the residuals above. The ACF plot of the residuals clearly indicates a correlation exists at lags 8 and 20. This indicates the model isn’t quite capturing the correlation of the test data properly.

Train and Test Model: Second Attempt

Given the decomposition and the model evaluations indicate a change in seasonal pattern sometime in February 2010, I will attempt to train and test on just the data beginning on 2010-02-16. Yes, this certainly reduces the amount of data used to build the model, but I believe may better capture the change in the weekly pattern and thus provide better forecasts for the month of May 2010.

# ATM1 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm1 <- atm1_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-30")

train_atm1 <- atm1_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-14")

# Fit the models
fit_atm1 <- train_atm1 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm1)
## # A tibble: 5 × 12
##   ATM   .model  sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots    MSE   AMSE
##   <chr> <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>    <dbl>  <dbl>
## 1 ATM1  Seaso… 575.        NA    NA    NA    NA  <NULL>   <NULL>      NA     NA 
## 2 ATM1  Arima  311.      -219.  447.  448.  455. <cpl [9… <cpl [0…    NA     NA 
## 3 ATM1  ETS_A… 327.      -281.  582.  586.  602. <NULL>   <NULL>     276.   330.
## 4 ATM1  ETS_M…   0.155   -308.  636.  641.  657. <NULL>   <NULL>    1046.  1437.
## 5 ATM1  ETS_D…   0.405   -329.  683.  692.  710. <NULL>   <NULL>   27524. 41523.
## # … with 1 more variable: MAE <dbl>

Now using only the 5 top performing models due to the seasonal nature of the data, the ARIMA model performs the best of the 4 models by evaluating the AICc of 447.

fit_atm1 %>% accuracy()
## # A tibble: 5 × 11
##   ATM   .model         .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>          <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM1  Seasonal naive Trai…  -5.20   24.3  15.9 -214.  230.  1     1      0.444 
## 2 ATM1  Arima          Trai…   0.138  16.0  11.9  -97.6 143.  0.753 0.660 -0.127 
## 3 ATM1  ETS_Add        Trai…  -1.19   16.6  11.7  -56.5  68.7 0.739 0.684  0.0722
## 4 ATM1  ETS_Mult       Trai…  -2.40   32.3  20.8 -125.  145.  1.31  1.33   0.480 
## 5 ATM1  ETS_Damp       Trai… -10.4   166.   61.7  -13.8  68.2 3.89  6.82   0.412

The ARIMA model performs the best on the training set with an RMSE of 16.05, while the ETS Additive performs almost as well as the ARIMA model. The Seasonal Naive model performs third best and the ETS with dampening is clearly not doing well compared to the other four models.

# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm1 <- fit_atm1 %>% forecast(h = "16 days")

fc_atm1 %>% accuracy(new_seasonal_atm1)
## # A tibble: 5 × 11
##   .model         ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM1  Test   7.71  12.8   9.50 40.9  43.2  0.599 0.527  0.0154 
## 2 ETS_Add        ATM1  Test   1.06  11.8   8.88  5.50 15.3  0.560 0.486 -0.337  
## 3 ETS_Damp       ATM1  Test  46.6   54.0  46.6  57.3  58.4  2.94  2.22   0.0345 
## 4 ETS_Mult       ATM1  Test   9.17  15.1  11.0  -2.95 26.2  0.694 0.621 -0.161  
## 5 Seasonal naive ATM1  Test   0.125  9.64  7.12  1.26  9.83 0.449 0.397 -0.00791

For the accuracy of the five models on the test set, the Seasonal Naive model actually performs the best with an RMSE of 9.64, while ARIMA and ETS Additive, also perform well.

# Plot forecasts against actual values
fc_atm1 %>%
  autoplot(filter_index(train_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The plot of the test forecasts confirms the models are correctly capturing the seasonal nature of the data near the end of the dataset.

fit_atm1_arima<- train_atm1 %>%
  model(ARIMA(Cash))

# Check the residuals of ARIMA
fit_atm1_arima %>% gg_tsresiduals()

As above, I’ve displayed the residuals of the ARIMA model to assess the appropriateness of the model. The plot of Innovation residuals appears to follow white noise after the first 10 observations or so. The ACF plot indicates no correlation outside of the confidence intervals.

# Consider Box-Cox
lambda <- new_seasonal_atm1 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

new_seasonal_atm1 %>%
  autoplot(box_cox(Cash, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM1 withdrawals with $\\lambda$ = ",
         round(lambda,2))))

Applying Box-Cox to see if I can squeeze out a little better performance on the test data.

# Forecast for May 2010
fit_atm1_bc <- train_atm1 %>%
  model(
    `Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
    ARIMA(box_cox(Cash, lambda)),
    ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A"))
  )

report(fit_atm1_bc)
## # A tibble: 3 × 12
##   ATM   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots   MSE  AMSE
##   <chr> <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>   <dbl> <dbl>
## 1 ATM1  Seasonal…   323.     NA    NA    NA    NA  <NULL>   <NULL>     NA    NA 
## 2 ATM1  ARIMA(bo…   175.   -206.  417.  418.  423. <cpl [1… <cpl [7…   NA    NA 
## 3 ATM1  ETS_Add     182.   -264.  548.  552.  568. <NULL>   <NULL>    154.  185.
## # … with 1 more variable: MAE <dbl>

The AICc of the ARIMA model with Box-Cox transformation has improved from 447.73 to 417.98. A promising sign.

fit_atm1_bc %>% accuracy()
## # A tibble: 3 × 11
##   ATM   .model         .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>          <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM1  Seasonal naive Train… -5.20  24.3  15.9 -214.  230.  1     1      0.444 
## 2 ATM1  ARIMA(box_cox… Train… -2.56  16.1  12.3 -105.  127.  0.773 0.661 -0.313 
## 3 ATM1  ETS_Add        Train… -1.11  16.6  12.0  -66.2  77.5 0.755 0.682  0.0836

The accuracy of the models with Box-Cox show little to no improvement over the models without transformations on the training set. The RMSE of the ARIMA model with the transformation actually increases from 16.04555 to 16.06325 as noted above.

# Generate forecasts for the next 16 days
fc_atm1_bc <- fit_atm1_bc %>% forecast(h = "16 days")

fc_atm1_bc %>% accuracy(new_seasonal_atm1)
## # A tibble: 3 × 11
##   .model          ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>           <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA(box_cox(… ATM1  Test  -6.25  11.5   9.35 -32.6  36.2 0.589 0.474 -0.360 
## 2 ETS_Add         ATM1  Test  -0.247 11.8   9.24 -24.8  35.4 0.582 0.487 -0.398 
## 3 Seasonal naive  ATM1  Test  -1.05   9.87  7.88 -21.8  29.4 0.497 0.406 -0.0900

The RMSE of the ARIMA model with Box-Cox transformation on the test data does improve the RMSE compared to the ARIMA model without transformation from 12.812248 to 11.53609.

# Plot forecasts against actual values
fc_atm1_bc %>%
  autoplot(filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The above plot shows the forecasts of the models with transformation along with the actual data. Similar to the plot from the models with transformations, the forecasts do follow the seasonal pattern of the data.

ATM1 Conclusion

Given the assignment is to forecast the data for May 2010 from the provided data, I believe the models generated from the final two and a half months is the appropriate approach. Clearly, a shift in the weekly pattern occurs in February 2010 and without any additional context or history, I assume that shift will continue through May 2010. As the forecast is only 1 month or approximately four more weeks from the end of the provided data, then I choose to believe that new pattern will persist for the next four weeks through May 2010.

ATM2

Data observations: ATM2: 2 Cash NA values

atm2_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM2')

atm2_data_ts %>%  autoplot(Cash)

# Calculate median value for ATM2
median <- median(atm2_data_ts$Cash, na.rm=TRUE)

# Set NAs to median
atm2_data_ts$Cash[is.na(atm2_data_ts$Cash)] <- median

atm2_data_ts %>%  autoplot(Cash)

atm2_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))
        ) %>% 
  components() %>% autoplot()

Same thing as ATM1, the seasonal nature of the data changes in the final few months. I’m gonna separate the forecast into just the final section where the weekly seasonal nature shifts.

atm2_data_ts %>% ACF() %>% autoplot()

atm2_data_ts %>% PACF() %>% autoplot()

atm2_data_ts %>% gg_season(Cash, period = "week") +
  theme(legend.position = "none") +
  labs(y="$ (in hundreds)", title="ATM Withdrawals")

atm2_data_ts %>%
  gg_subseries(period = "week") +
  labs(
    y = "$ (in hundreds)",
    title = "ATM Withdrawals"
  )

XXX

atm2_dcmp <- atm2_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm2_dcmp %>% components() %>% autoplot()

components(atm2_dcmp) %>%
  as_tsibble() %>%
  filter_index("2009-05-01" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

components(atm2_dcmp) %>%
  as_tsibble() %>%
  filter_index("2010-02-16" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

ATM3

Now, for ATM3.

atm3_data_ts_all <- atm_data_ts %>%
  filter(ATM == 'ATM3')

summary(atm3_data_ts_all)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:365         Min.   : 0.0000  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 0.0000  
##  Median :2009-10-30   Mode  :character   Median : 0.0000  
##  Mean   :2009-10-30                      Mean   : 0.7206  
##  3rd Qu.:2010-01-29                      3rd Qu.: 0.0000  
##  Max.   :2010-04-30                      Max.   :96.0000

From summary output, the date range (2009-05-01 to 2010-04-03) matches expectations given the data from ATM1 and ATM2. The Cash column does not contain any missing data but does indicate curious aspects. The range is 0 to 96 which might be odd to have 0, but the median is also 0, which certainly indicates a curiosity about the data. The mean is also quite low, less than 1.

atm3_data_ts_all %>% autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")

A simply plot of the data clearly shows unexpected information. Only a few days at the end of the dataset contain anything above zero.

Remove data

atm3_data_ts <- atm3_data_ts_all %>%
  filter(ATM == 'ATM3', Cash > 0)

atm3_data_ts
## # A tsibble: 3 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-28 ATM3     96
## 2 2010-04-29 ATM3     82
## 3 2010-04-30 ATM3     85

By removing every date with a value equal to 0, only three dates remain - the final three dates of the dataset.

summary(atm3_data_ts)
##       DATE                ATM                 Cash      
##  Min.   :2010-04-28   Length:3           Min.   :82.00  
##  1st Qu.:2010-04-28   Class :character   1st Qu.:83.50  
##  Median :2010-04-29   Mode  :character   Median :85.00  
##  Mean   :2010-04-29                      Mean   :87.67  
##  3rd Qu.:2010-04-29                      3rd Qu.:90.50  
##  Max.   :2010-04-30                      Max.   :96.00

For those three observations, the range is 82 to 96 with a mean of 87.67.

atm3_data_ts %>% autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")

The plot confirms that only three dates contain a Cash amount greater than zero.

ATM3 Model

A decision much be made in how to approach this dataset. Of 365 days, 362 contain a Cash amount of zero and the final three dates contain a Cash amount meeting expectations. Without history or context of ATM3, I will make the assumption that ATM3 was broken or in some way not usable for everyday with a Cash amount of zero. My assumption would follow that starting on 2010-04-28 ATM3 was fixed or made active, and thus can be assumed working and active for the month of May 2010.

With so little data to build a model, I will choose the MEAN function based on the three dates containing Cash amounts above zero.

fit_atm3 <- atm3_data_ts %>% model(MEAN(Cash))

report(fit_atm3)
## Series: Cash 
## Model: MEAN 
## 
## Mean: 87.6667 
## sigma^2: 54.3333

The model returns an expected value of 87.67 as the mean, which was indicated in the summary of the data.

# Forecast for 31 days of May 2010
fc_atm3 <- fit_atm3 %>% forecast(h = "31 days")

fc_atm3 %>%
  autoplot(filter_index(atm3_data_ts_all, "2009-05-01" ~ "2010-05-31"), level = NULL) +
  autolayer(
    filter_index(atm1_data_ts, "2010-04-28" ~ "2010-05-31"),
    colour = "black"
  ) +
  labs(y="Cash (in hundreds $USD)", title="ATM3 Withdrawals with Forecast") +
  guides(colour = guide_legend(title = "Forecast"))

Not a very interesting plot of the forecasted data in relation to the ATM3 dataset, but given the minimal amount of data, the mean forecast is a constant line for the month of May 2010.

ATM3 Conclusion

There’s only 3 values with data above zero, so I’m assuming there was absolutely no activity but the ATM3 existed, or else it would have NA.

ATM4

atm4_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM4')

summary(atm4_data_ts)
##       DATE                ATM                 Cash          
##  Min.   :2009-05-01   Length:365         Min.   :    1.563  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:  124.334  
##  Median :2009-10-30   Mode  :character   Median :  403.839  
##  Mean   :2009-10-30                      Mean   :  474.043  
##  3rd Qu.:2010-01-29                      3rd Qu.:  704.507  
##  Max.   :2010-04-30                      Max.   :10919.762
#atm4_data_ts

As expected, the summary for the ATM4 dataset falls on the dates 2009-05-01 through 2010-04-30. The Cash column indicats no missing values. The range of the Cash column is 1.563 to 10919.762 with a mean of 474.043. The low value is certainly low, but thankfully not zero, as addressed in ATM3. The high value is quite high compared to the mean and the third quartile. Perhaps an outlier or a few exists in the ATM4 data.

atm4_data_ts %>% autoplot(Cash) +
  labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")

The plot of the ATM4 data shows a clear outlier in February 2010.

atm4_data_ts %>%
  filter(Cash > 3000)
## # A tsibble: 1 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM     Cash
##   <date>     <chr>  <dbl>
## 1 2010-02-09 ATM4  10920.

The day of the outlier turns out to be 2010-02-09.

Decomposition

atm4_dcmp <- atm4_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm4_dcmp %>% components() %>% autoplot()

Given the weekly seasonal pattern worked for ATM1 and ATM2, let’s try decomposition on ATM4. Given the plot above, the outlier makes the plots near impossible to evaluate.

Impute outlier and missing data

Let’s the remove the outlier by imputing the median value in its place.

# Calculate median value for ATM
# Straightforward approach to impute data
median <- median(atm4_data_ts$Cash, na.rm=TRUE)

# Get rid of outlier from ATM4
atm4_data_ts$Cash[atm4_data_ts$Cash > 3000] <- median 

atm4_data_ts %>% autoplot(Cash) +
  labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")

By removing the outlier, the data does appear more approachable. Overall, I don’t detect a clear trend, perhaps a slight decrease. Given the data represents a single year, I don’t expect to find a cyclic component, but perhaps a weekly seasonal component exists. Let’s try the decomposition again, now excluding the outlier.

atm4_dcmp <- atm4_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))

atm4_dcmp %>% components() %>% autoplot()

The remainder component does play a larger factor than the seasonal component. As identified in ATM1 and ATM2, the trend component has minimal impact. Let’s try the seasonally adjusted plot.

components(atm4_dcmp) %>%
  as_tsibble() %>%
  filter_index("2009-05-01" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

The seasonally-adjusted trendline plot above does not convey any clear weekly pattern.

atm4_data_ts %>% ACF() %>% autoplot()

The ACF plot indicates correlation at lags 7, 14, and 21 as expected. Also, correlation appears at lag 10.

atm4_data_ts %>% PACF() %>% autoplot()

The PACF shows correlation at lags 7, 10, and 19.

atm4_data_ts %>% gg_season(Cash, period = "week") +
  labs(y="$ (in hundreds)", title="ATM4 Withdrawals")

I’ll admit the above plot of gg_season does not provide much value except perhaps for some nice colors.

atm4_data_ts %>%
  gg_subseries(period = "week") +
  labs(
    y = "$ (in hundreds)",
    title = "ATM Withdrawals"
  )

The `gg_subseries`` plot above does indicate a change in pattern on Tuesday, Wednesday, and Thursday, similar to that found in ATM1 and ATM2, but not quite as drastic and clear.

atm4_data_post0209 <- atm4_data_ts %>%
  filter_index("2010-02-10" ~ "2010-04-30")

atm4_data_post0209 %>%
  autoplot(Cash) +
  labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")

atm4_dcmp <- atm4_data_post0209 %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))

atm4_dcmp %>% components() %>% autoplot()

components(atm4_dcmp) %>%
  as_tsibble() %>%
  filter_index("2010-02-10" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

Still too irregular. Try ARIMA and ETS and see what sort of results I can get.

Train and Test

Given I’m struggling to find a viable weekly seasonal pattern through visual inspection, I will apply the different seasonal models through the train and test approach.

# Create training set (1 year of data)
# 292 days for training, approx. 80% of data
train_atm4 <- atm4_data_ts %>% 
  filter_index("2009-05-01" ~ "2010-02-17")

# Fit the models
fit_atm4 <- train_atm4 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm4)
## # A tibble: 5 × 12
##   ATM   .model        sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots     MSE
##   <chr> <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>     <dbl>
## 1 ATM4  Seasonal …   2.09e+5     NA    NA    NA    NA  <NULL>   <NULL>       NA 
## 2 ATM4  Arima        1.25e+5  -2134. 4274. 4274. 4285. <cpl [7… <cpl [0…     NA 
## 3 ATM4  ETS_Add      1.05e+5  -2521. 5063. 5063. 5099. <NULL>   <NULL>   101739.
## 4 ATM4  ETS_Mult     5.14e-1  -2469. 4958. 4959. 4995. <NULL>   <NULL>   103349.
## 5 ATM4  ETS_Damp     4.88e-1  -2478. 4981. 4982. 5029. <NULL>   <NULL>   106682.
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>
fit_atm4 %>% accuracy()
## # A tibble: 5 × 11
##   ATM   .model         .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>          <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM4  Seasonal naive Trai…   2.11    456.  347. -366.  422. 1     1     0.0620
## 2 ATM4  Arima          Trai…   0.0482  352.  295. -501.  535. 0.850 0.772 0.0819
## 3 ATM4  ETS_Add        Trai…  -9.31    319.  249. -421.  450. 0.717 0.699 0.110 
## 4 ATM4  ETS_Mult       Trai…  11.5     321.  251. -412.  444. 0.723 0.705 0.108 
## 5 ATM4  ETS_Damp       Trai… -18.3     327.  257. -434.  463. 0.741 0.716 0.0931
# Generate forecasts for 72 days
fc_atm4 <- fit_atm4 %>% forecast(h = "72 days")

fc_atm4 %>% accuracy(atm4_data_ts)
## # A tibble: 5 × 11
##   .model         ATM   .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM4  Test   -19.2   319.  266.  -777.  808. 0.768 0.699 -0.0357 
## 2 ETS_Add        ATM4  Test   -20.7   386.  319.  -918.  960. 0.919 0.847  0.0325 
## 3 ETS_Damp       ATM4  Test   -34.1   400.  330. -1017. 1059. 0.951 0.878  0.00763
## 4 ETS_Mult       ATM4  Test    -5.37  387.  321.  -913.  958. 0.925 0.849  0.00836
## 5 Seasonal naive ATM4  Test  -101.    520.  417. -1484. 1530. 1.20  1.14  -0.0989
# Plot forecasts against actual values
fc_atm4 %>%
  autoplot(filter_index(train_atm4, "2010-02-15" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(atm4_data_ts, "2010-02-15" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM4 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The above plot indicates the models are not following the weekly seasonal pattern. Let me try the truncated version similar to ATM1 and ATM2.

# ATM4 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm4 <- atm4_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-30")

train_atm4 <- atm4_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-14")

# Fit the models
fit_atm4 <- train_atm4 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm4)
## # A tibble: 5 × 12
##   ATM   .model        sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots     MSE
##   <chr> <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>     <dbl>
## 1 ATM4  Seasonal …   2.09e+5     NA    NA    NA    NA  <NULL>   <NULL>   NA     
## 2 ATM4  Arima        1.11e+5   -419.  841.  841.  845. <cpl [0… <cpl [0… NA     
## 3 ATM4  ETS_Add      9.95e+4   -447.  913.  918.  934. <NULL>   <NULL>    8.41e4
## 4 ATM4  ETS_Mult     5.76e-1   -452.  924.  928.  944. <NULL>   <NULL>    4.57e5
## 5 ATM4  ETS_Damp     3.74e+0   -517. 1060. 1068. 1087. <NULL>   <NULL>    4.09e6
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>
fit_atm4 %>% accuracy()
## # A tibble: 5 × 11
##   ATM   .model     .type         ME  RMSE   MAE     MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>      <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM4  Seasonal … Train… -1.05e+ 1  453.  349. -524.    578. 1     1     0.0769
## 2 ATM4  Arima      Train…  1.94e-10  330.  280. -747.    780. 0.803 0.728 0.0556
## 3 ATM4  ETS_Add    Train… -3.45e+ 1  290.  228. -204.    236. 0.652 0.640 0.0543
## 4 ATM4  ETS_Mult   Train… -1.75e+ 2  676.  391. -301.    328. 1.12  1.49  0.512 
## 5 ATM4  ETS_Damp   Train… -1.51e+ 1 2022. 1372.   -4.63  506. 3.93  4.46  0.803
# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm4 <- fit_atm4 %>% forecast(h = "16 days")

fc_atm4 %>% accuracy(new_seasonal_atm4)
## # A tibble: 5 × 11
##   .model         ATM   .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM4  Test  -135.   278.  229.  -924.  937. 0.655 0.613 -0.308  
## 2 ETS_Add        ATM4  Test  -147.   274.  213.  -636.  645. 0.610 0.605  0.0327 
## 3 ETS_Damp       ATM4  Test   616.   690.  616.   670.  670. 1.77  1.52  -0.194  
## 4 ETS_Mult       ATM4  Test   -92.1  242.  189.  -618.  632. 0.542 0.534 -0.00529
## 5 Seasonal naive ATM4  Test  -244.   449.  370. -1178. 1201. 1.06  0.991  0.219
# Plot forecasts against actual values
fc_atm4 %>%
  autoplot(filter_index(train_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM4 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

fit_atm4_ets <- train_atm4 %>%
  model(ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")))

# Check the residuals of ARIMA
fit_atm4_ets %>% gg_tsresiduals()

ETS Multiplicative appears to perform best

Box-cox

# Consider Box-Cox
lambda <- new_seasonal_atm4 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

new_seasonal_atm4 %>%
  autoplot(box_cox(Cash, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM4 withdrawals with $\\lambda$ = ",
         round(lambda,2))))

fit_atm4_bc <- train_atm4 %>%
  model(
    `Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
    ARIMA(box_cox(Cash, lambda)),
    ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M")),
  )

report(fit_atm4_bc)
## # A tibble: 4 × 12
##   ATM   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots   MSE  AMSE
##   <chr> <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>   <dbl> <dbl>
## 1 ATM4  Seasonal… 53.7       NA    NA    NA    NA  <NULL>   <NULL>    NA    NA  
## 2 ATM4  ARIMA(bo… 23.6     -176.  360.  361.  368. <cpl [1… <cpl [0…  NA    NA  
## 3 ATM4  ETS_Add   25.0     -206.  432.  437.  453. <NULL>   <NULL>    21.1  18.5
## 4 ATM4  ETS_Mult   0.135   -216.  451.  456.  472. <NULL>   <NULL>    30.5  31.0
## # … with 1 more variable: MAE <dbl>
fit_atm4_bc %>% accuracy()
## # A tibble: 4 × 11
##   ATM   .model          .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>           <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM4  Seasonal naive  Train… -10.5  453.  349. -524.  578. 1     1      0.0769
## 2 ATM4  ARIMA(box_cox(… Train…  60.6  293.  239. -218.  257. 0.684 0.646 -0.0113
## 3 ATM4  ETS_Add         Train…  51.8  298.  218. -126.  158. 0.626 0.658  0.0641
## 4 ATM4  ETS_Mult        Train…  11.0  370.  283. -310.  348. 0.810 0.816  0.144
# Generate forecasts for the next 16 days
fc_atm4_bc <- fit_atm4_bc %>% forecast(h = "16 days")

fc_atm4_bc %>% accuracy(new_seasonal_atm4)
## # A tibble: 4 × 11
##   .model           ATM   .type    ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>            <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ARIMA(box_cox(C… ATM4  Test  -174.  321.  246. -1011. 1025. 0.705 0.708 0.380 
## 2 ETS_Add          ATM4  Test  -207.  341.  275.  -737.  747. 0.788 0.752 0.0637
## 3 ETS_Mult         ATM4  Test  -317.  459.  363.  -922.  930. 1.04  1.01  0.401 
## 4 Seasonal naive   ATM4  Test  -616.  823.  705. -2079. 2094. 2.02  1.82  0.433
# Plot forecasts against actual values
fc_atm4_bc %>%
  autoplot(filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM4 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Box-Cox was not better, so go with model without transformation

Forecast

Overall daily totals across all four ATMs

Total by day and see if there is an overall relationship to be understood

atm_data_all <- bind_rows(atm1_data_ts,
                          atm2_data_ts,
                          atm3_data_ts_all,
                          atm4_data_ts)

atm_data_total_by_day <- atm_data_all %>%
  index_by(DATE) %>%
  summarize(Cash_Total = sum(Cash))

#atm_data_total_by_day

atm_data_total_by_day %>% autoplot(Cash_Total)

# Seasonally adjusted plot
atm_dcmp <- atm_data_total_by_day %>%
  model(stl = STL(Cash_Total ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm_dcmp %>% components() %>% autoplot()

components(atm_dcmp) %>%
  as_tsibble() %>%
  autoplot(Cash_Total, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

Part B – Residential Customer Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

# Read in data
power_data_raw <- read_excel("data/ResidentialCustomerForecastLoad-624.xlsx")

# Change column name to 'Month'
names(power_data_raw)[names(power_data_raw) == 'YYYY-MMM'] <- 'Month'

# Display first 5 rows for visual inspection
#head(power_data_raw)

# Display summary for initial assessment
summary(power_data_raw)
##   CaseSequence      Month                KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

The summary of the residential power data indicates 192 months, which is exactly 16 years. The KWH column shows 1 missing value with a range of 770523 to 10655730. The minimum value being so distanct from the first quartile value may indicate an outlier. The CaseSequence column appears to be a integer unique identifer for each month, so will ignore moving forward.

# Display dimensions for assessment
# should be 16 years of data 1998-2013
# Yep, 192/12 = 16
dim(power_data_raw)
## [1] 192   3

The dimensions output confirms the summary output, 192 observations with three columns.

power_data_ts <- power_data_raw %>%
  mutate(Month = yearmonth(Month)) %>%
  mutate(KWH = KWH/1e3) %>% # In thousands
  select(-c(CaseSequence)) %>%
  as_tsibble(index = Month)

# Output first 5 rows of tsibble
#head(power_data_ts)

# Inital plot of data
power_data_ts %>%
  autoplot(KWH)

After converting the raw data to a tsibble with the Month as the index, we do see a missing value perhaps sometime in 2008 and an outlier in early 2010.

power_data_ts$Month[is.na(power_data_ts$KWH)]
## <yearmonth[1]>
## [1] "2008 Sep"

Month with missing value is “2008 Sep”.

power_data_ts$Month[power_data_ts$KWH < 3000]
## <yearmonth[2]>
## [1] NA         "2010 Jul"

Month with outlier is “2010 Jul”.

First observations, data is Monthly, so I’d expect a seasonal component. 1 month is missing KWH has 1 NA value (2008 Sep) Considering imputing with median Sep Value Outlier (with very small value in July 2010) Considering imputing with median Jul Value

Impute data

Given the data is monthly and appears to follow a clear seasonal pattern, I will impute the missing value and the outlier value by calculating the median value for each month for imputation.

First, I will find the median value for all September entries.

# I want all the September values
sep_data <- power_data_ts %>%
  filter(str_detect(Month, "Sep"))

sep_data <- as_tibble(sep_data)

sep_kwh_med <- sep_data %>% 
  summarise(Median = median(KWH, na.rm = TRUE))

sep_kwh_med
## # A tibble: 1 × 1
##   Median
##    <dbl>
## 1  7667.
# Median 7666.97

power_data_ts[129, 2] <- sep_kwh_med

Median value for all other September values is 7666.97.

# I want all the July values
jul_data <- power_data_ts %>%
  filter(str_detect(Month, "Jul"))

jul_data <- as_tibble(jul_data)

jul_kwh_med <- jul_data %>% 
  summarise(Median = median(KWH, na.rm = TRUE))

jul_kwh_med
## # A tibble: 1 × 1
##   Median
##    <dbl>
## 1  7679.
# Median 7678.623   

power_data_ts[151, 2] <- jul_kwh_med

Median value for all other July values is 7695.942.

power_data_ts %>% autoplot(KWH)

Above plot shows the residential customer power data with imputations. Clearly a seasonal pattern exists along with an increasing variance near the end of the plot. Perhaps a Box-Cox transformation could be useful. Also, an slight increasing trend does appear, at least for the second half of the plot.

Decomposition

power_dcmp <- power_data_ts %>%
  model(stl = STL(KWH ~ trend(window=Inf) + season(period=12, window="periodic"))) 
power_dcmp %>% components() %>% autoplot()

The decomposition of the data indicates an increasing trend along with a seasonal pattern and remainder component that appear equal in weight.

components(power_dcmp) %>%
  as_tsibble() %>%
  autoplot(KWH, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "KWH (in thousands)", title = "Seasonally Adjusted Trendline")

The plot of the seasonally adjusted data indicates the seasonal pattern does persist through the full dataset with a few exceptions throughout.

power_data_ts %>% ACF() %>% autoplot()

ACF plot shows the highest correlation at lag 12, which is expected for monthly data over the course of several years.

power_data_ts %>% PACF() %>% autoplot()

PACF plot indicates autocorrelation at lags 1, 2, 5, 11, and 12, interesting result for monthly data in which I expected the PACF plot to show the higest correlation at lag 12.

power_data_ts %>% gg_season(KWH, period = "year") +
  labs(y="KWH (in thousands)", title="Residential Customer Power Usage")

The gg_season plot certainly shows a yearly pattern with low usage in the spring and fall and higher usage in the summer and winter.

power_data_ts %>%
  gg_subseries(period = "year") +
  labs(y="KWH (in thousands)", title="Residential Customer Power Usage")

TODO: The gg_subseries plot indicates:

TODO: Too soon with this

# Seasonal differencing
power_data_ts %>%
  gg_tsdisplay(difference(KWH, 12),
               plot_type='partial', lag=36) +
  labs(title="Seasonally differenced", y="")

Train and Test Model

# Train on 154 months, which is 12 years and 8 months (80% of the total)
# Total is 192 months (16 years)

train_power_data <- power_data_ts %>% 
  filter_index("1998-Jan" ~ "2010-Oct")

Calculate lambda for the Box-Cox transformation

lambda <- power_data_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

train_power_data %>%
  autoplot(box_cox(KWH, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM1 withdrawals with $\\lambda$ = ",
         round(lambda,2))))

# Fit the models
fit_power_bc <- train_power_data %>%
  model(
    Naive = NAIVE(box_cox(KWH, lambda)),
    `Seasonal naive` = SNAIVE(box_cox(KWH, lambda)),
    `Random walk` = RW(box_cox(KWH, lambda)),
    Arima = ARIMA(box_cox(KWH, lambda)),
    ETS_Add = ETS(box_cox(KWH, lambda) ~ error("A") + trend("A") + season("A")),
    ETS_Mult = ETS(box_cox(KWH, lambda) ~ error("M") + trend("A") + season("M")),
    ETS_Damp = ETS(box_cox(KWH, lambda) ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_power_bc)
## # A tibble: 7 × 11
##   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots      MSE     AMSE
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>      <dbl>    <dbl>
## 1 Naive    1.21e-3     NA    NA    NA    NA  <NULL>   <NULL>   NA       NA      
## 2 Seasona… 3.58e-4     NA    NA    NA    NA  <NULL>   <NULL>   NA       NA      
## 3 Random … 1.21e-3     NA    NA    NA    NA  <NULL>   <NULL>   NA       NA      
## 4 Arima    2.10e-4    398. -785. -785. -771. <cpl [1… <cpl [1… NA       NA      
## 5 ETS_Add  2.22e-4    268. -503. -498. -451. <NULL>   <NULL>    1.99e-4  2.05e-4
## 6 ETS_Mult 1.33e-5    269. -504. -500. -452. <NULL>   <NULL>    1.97e-4  2.04e-4
## 7 ETS_Damp 1.35e-5    268. -501. -496. -446. <NULL>   <NULL>    1.99e-4  2.08e-4
## # … with 1 more variable: MAE <dbl>
fit_power_bc %>% accuracy()
## # A tibble: 7 × 10
##   .model         .type       ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>          <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Naive          Training -6.45 1289. 1069. -2.21  17.2  1.92  1.80  0.182 
## 2 Seasonal naive Training 69.0   715.  557.  0.478  8.79 1     1     0.267 
## 3 Random walk    Training -6.45 1289. 1069. -2.21  17.2  1.92  1.80  0.182 
## 4 Arima          Training  7.10  518.  397. -0.309  6.34 0.713 0.725 0.0885
## 5 ETS_Add        Training 14.8   542.  422. -0.447  6.69 0.758 0.758 0.324 
## 6 ETS_Mult       Training 14.0   539.  414. -0.480  6.55 0.744 0.754 0.311 
## 7 ETS_Damp       Training 53.2   543.  414.  0.167  6.50 0.744 0.760 0.296

The best RMSE on the training data was attained by the ARIMA model followed by the three ETS models.

# Generate forecasts for 38 months, so we'll see if it picks up the seasonal nature
fc_power_bc <- fit_power_bc %>% forecast(h = "38 months")

fc_power_bc %>% accuracy(power_data_ts)
## # A tibble: 7 × 10
##   .model         .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>          <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima          Test    621. 1034.  757.   7.25  9.42  1.36  1.45 0.229
## 2 ETS_Add        Test    633. 1084.  798.   7.33  9.99  1.43  1.52 0.175
## 3 ETS_Damp       Test    691. 1107.  820.   8.13 10.2   1.47  1.55 0.172
## 4 ETS_Mult       Test    590. 1038.  751.   6.77  9.36  1.35  1.45 0.164
## 5 Naive          Test  -1348. 2580. 2155. -23.4  32.0   3.87  3.61 0.689
## 6 Random walk    Test  -1348. 2580. 2155. -23.4  32.0   3.87  3.61 0.689
## 7 Seasonal naive Test    303. 1018.  850.   2.77 11.0   1.53  1.42 0.419

The best RMSE is the Seasonal Naive model followed closely by the Arima model and the ETS model with Multiplicative.

# Plot forecasts against actual values
fc_power_bc %>%
  autoplot(filter_index(power_data_ts, "2010-Nov" ~ "2013-Dec"), level = NULL) +
  autolayer(
    filter_index(power_data_ts, "2010-Nov" ~ "2013-Dec"),
    colour = "black"
  ) +
  labs(
    y = "KWH (in thousands)",
    title = "Power Forecast"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Forecast

Let’s forecast for the year 2014 with the three best performing models.

# Generate forecasts for 50 months, which covers the end of the provided time series and the next 12 months of 2014
fc_power_bc14 <- fit_power_bc %>% forecast(h = "50 months")

fc_power_bc14 %>%
  autoplot(filter_index(power_data_ts, "2010-Nov" ~ "2014-Dec"), level = NULL) +
  autolayer(
    filter_index(power_data_ts, "2010-Nov" ~ "2014-Dec"),
    colour = "black"
  ) +
  labs(
    y = "KWH (in thousands)",
    title = "Power Forecast"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Go with the ARIMA model.

Part C – Waterflow (optional)

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

library(lubridate)
# Read in data
pipe1_data_raw <- read_excel("data/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
pipe2_data_raw <- read_excel("data/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))

# From: https://stackoverflow.com/questions/53635818/convert-datetime-from-excel-to-r

pipe1_data_raw$`Date Time` <- as.POSIXct(pipe1_data_raw$`Date Time`,
                              origin="1899-12-30",
                              tz="GMT")

pipe2_data_raw$`Date Time` <- as.POSIXct(pipe2_data_raw$`Date Time`,
                              origin="1899-12-30",
                              tz="GMT")


# Initial output to see data
head(pipe1_data_raw)
## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 00:24:06     23.4 
## 2 2015-10-23 00:40:02     28.0 
## 3 2015-10-23 00:53:51     23.1 
## 4 2015-10-23 00:55:40     30.0 
## 5 2015-10-23 01:19:17      6.00
## 6 2015-10-23 01:23:58     15.9
head(pipe2_data_raw)
## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 01:00:00      18.8
## 2 2015-10-23 02:00:00      43.1
## 3 2015-10-23 03:00:00      38.0
## 4 2015-10-23 04:00:00      36.1
## 5 2015-10-23 05:00:00      31.9
## 6 2015-10-23 06:00:00      28.2
summary(pipe1_data_raw)
##    Date Time                     WaterFlow     
##  Min.   :2015-10-23 00:24:06   Min.   : 1.067  
##  1st Qu.:2015-10-25 11:21:35   1st Qu.:13.683  
##  Median :2015-10-27 20:07:30   Median :19.880  
##  Mean   :2015-10-27 20:49:15   Mean   :19.897  
##  3rd Qu.:2015-10-30 08:24:51   3rd Qu.:26.159  
##  Max.   :2015-11-01 23:35:43   Max.   :38.913
summary(pipe2_data_raw)
##    Date Time                     WaterFlow     
##  Min.   :2015-10-23 01:00:00   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
##  Median :2015-11-12 20:30:00   Median :39.682  
##  Mean   :2015-11-12 20:30:00   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :78.303
dim(pipe1_data_raw)
## [1] 1000    2
dim(pipe2_data_raw)
## [1] 1000    2
# Calculate the average flow per hour for pipe1
pipe1_data_raw$Date <- as.Date(pipe1_data_raw$`Date Time`)
pipe1_data_raw$Time <- hour(pipe1_data_raw$`Date Time`) + 1

pipe1_data_raw <- pipe1_data_raw %>%
  group_by(Date, Time) %>%
  summarise(WF_AVG = mean(WaterFlow)) %>% ungroup()

pipe1_data_raw$`Date Time` <- with(pipe1_data_raw, ymd_h(paste(Date, Time)))

pipe1_data_raw <- pipe1_data_raw %>% 
  select(c(`Date Time`, WF_AVG)) %>%
  rename(WaterFlow = WF_AVG)

summary(pipe1_data_raw)
##    Date Time                     WaterFlow     
##  Min.   :2015-10-23 01:00:00   Min.   : 8.923  
##  1st Qu.:2015-10-25 11:45:00   1st Qu.:17.033  
##  Median :2015-10-27 23:30:00   Median :19.784  
##  Mean   :2015-10-27 23:38:53   Mean   :19.893  
##  3rd Qu.:2015-10-30 11:15:00   3rd Qu.:22.789  
##  Max.   :2015-11-02 00:00:00   Max.   :31.730
# Define as tsibble
pipe1_data_ts <- pipe1_data_raw %>%
  as_tsibble(index = `Date Time`)

pipe1_data_ts %>% autoplot()

pipe2_data_ts <- pipe2_data_raw %>%
  as_tsibble(index = `Date Time`)

pipe2_data_ts %>% 
  filter_index("2015-10-23" ~ "2015-11-01") %>%
  autoplot() +
  autolayer(pipe1_data_ts)

#30#